Text to put somewhere

We hope this guide is a useful resource that helps you accomplish your existing research ideas or helps you come up with entirely new ideas. Please don’t hesitate to contact Aaron Williams () or Ajjit Narayanan () if you have any questions about this guide or need any assistance with R. And make sure to join our slack channel at #r-users-group!

Rob Pitingolo () and the Urban Institute Mapping Users Group are also a fantastic resource to learn more about maps.

Introduction


This guide will teach you the concepts and R libraries you will need for mapping and geospatial analysis in R. This is a long guide, so if you need something specific, we encourage you to scroll to the appropriate section using the Table of Contents on the left. If you just want to see code examples of maps you can copy and paste, go to the Map Gallery (coming soon!).

Should this be a map?

The Urban Institute Data Visualization Style Guide offers some blunt but useful suggestions for maps:

Just because you’ve got geographic data, doesn’t mean that you have to make a map. Many times, there are more efficient storyforms that will get your point across more clearly. If your data shows a very clear geographic trend or if the absolute location of a place or event matters, maps might be the best approach, but sometimes the reflexive impulse to map the data can make you forget that showing the data in another form might answer other—and sometimes more important—questions.

So we would encourage you to think critically beofore making a map

Why map with R?

R can have a steeper learning curve than point-and-click tools - like QGIS or ArcGIS - for geospatial analysis and mapping. But creating maps in R has many advantages including:

  1. Reproducibility: By creating maps with R code, you can easily share the outputs and the code that generated the output with collaborators, allowing them to replicate your work and catch errors easily.

  2. Iteration: With point and click software like ArcGIS, making 50 maps would be 50 times the work/time. But using R, we can easily make make many iterations of the same map with a few changes to the code.

  3. Easy Updates: Writing code provides a roadmap for others (and future you!) to quickly update parts of the map as needed. Say for example a collabarator wanted to change the legend colors of 50 state maps. With R, this is possible in just a few seconds!

  4. An Expansive ecosystem: There are several R packages that make it very easy to get spatial data, create static and interactive maps, and perform spatial analyses. This feature rich package ecosystem which all play nice together is frankly unmatched by other programming languages and even point and click tools like QGIS and ArcGIS. Some of these R packages include:

    • `sf`: For managing and analyzing spatial dataframes
    • `tigris`: For downloading in Census geographies
    • `ggplot2`: For making publication ready static maps
    • `urbnmapr`: For automatically adding Urban styling to static maps
    • `mapview`: For making expxploratory interactive maps
  5. Cost: Most point-and-click tools for geospatial analysis are proprietary and expensive. R is free opensource software. The software and most of its packages can be used for free by anyone for almost any use case.

Helpful Learning Resources

Basic Concepts


library(sf)

The sf library is a key tool for reading in, managing, and working with spatial data in R. sf stands for simple features (not San Francisco you Bay Area folks) and denotes a way to describe the spatial attributes of real life objects. The R object you will be working with most frequently for mapping is an sf dataframe. An sf dataframe is essentially a regular R dataframe, with a couple of extra features for use in mapping. These extra features exclusive to sf dataframes include:

  • sticky geometry columns
  • attached coordinate reference systems
  • some other spatial metadata

The most important of the above list is the sticky geometry column, which is a magical column that contains all of the geographic information for each row of data. Say for example you had a sf dataframe of all DC census tracts. Then the geometry column would contain all of the geographic points used to define DC census tract polygons. The stickiness of this column means that no matter what data munging/filtering you do, you will not be able to drop or delete the geometry column. Below is a graphic to help you understand this:

credits: @allisonhorst

And below is an example of reading in an sf dataframe and printing it to the R console.

# Read in spatial data about DC parks from DC Open Data Portal
dc_parks  = st_read("https://opendata.arcgis.com/api/v3/datasets/287eaa2ecbff4d699762bbc6795ffdca_9/downloads/data?format=geojson&spatialRefId=4326",
                    quiet = T)

# Select just a few columns for readability
dc_parks = dc_parks %>%
  select(NAME, geometry)

# Print to the console
dc_parks
## Simple feature collection with 256 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -77.11113 ymin: 38.81718 xmax: -76.91108 ymax: 38.98811
## Geodetic CRS:  WGS 84
## First 10 features:
##                            NAME                       geometry
## 1  Kingman and Heritage Islands MULTIPOLYGON (((-76.96566 3...
## 2               Bald Eagle Hill MULTIPOLYGON (((-77.01063 3...
## 3                   Fort Greble MULTIPOLYGON (((-77.01515 3...
## 4                  Fort Stanton MULTIPOLYGON (((-76.97946 3...
## 5              Town Center Park MULTIPOLYGON (((-77.01771 3...
## 6        Triangle Park RES 0179 MULTIPOLYGON (((-77.01512 3...
## 7            Walter Pierce Park MULTIPOLYGON (((-77.0456 38...
## 8                    Canal Park MULTIPOLYGON (((-77.00339 3...
## 9                    Canal Park MULTIPOLYGON (((-77.00339 3...
## 10                  Murch Field MULTIPOLYGON (((-77.07088 3...

Note that there is some spatial metadata such as the Geometry Type, Bounding Box, and CRS which shows up as a header before the actual contents of the dataframe.

Since sf dataframes operate similarly to regular dataframes, we can use all our familiar tidyverse functions for data wrangling, including select, filter, rename, mutate, group_by and summarize. The sf package also has many functions that provide easy ways to replicate common tasks done in other GIS software like spatial joins, clipping, and buffering. Almost all of the mapping and geospatial analysis methods described in this guide rely on you having an sf dataframe. So let’s talk about how to get one!

Importing spatial data

The first step to mapping in R is reading in your spatial data as an sf dataframe. There are many R packages that make this easy, depending on the kind of spatial data you want to read in.

For states and counties

We highly recommend using the library(urbnmapr) package, which was created by folks here at Urban to easily create state and county level maps. The get_urbn_map() function in the package allows you to read in spatial data on states and counties, with options to include territories. Importantly, it will also display AL and HI as insets on the map in accordance with the Urban Institute Data Visualization Style Guide. For information on how to install urbnmapr, see the GitHub repository.

Below is an example of how you would use urbnmapr to get an sf dataframe of all the states or counties in the US.

library(urbnmapr)

# Get state data
states <- get_urbn_map("states", sf = TRUE)

# Can also get county data
counties <- get_urbn_map("counties", sf = TRUE)

For other Census geographies

We recommend using the library(tigris) package, which allows you to easily download TIGER and other cartographic boundaries from the US Census Bureau. In order to automatically load in the boundaries as sf objects, run once per R session.

library(tigris) has all the standard census geographies, including census tracts, counties, CBSAs, ZCTAs, congressional districts, tribal areas, and more. It also includes other elements such as water, roads, and military bases. A couple of notes on library(tigris):

  • By default, libraray(tigris) will download large very large and detailed TIGER line boundary files. For thematic mapping, the smaller cartographic boundary files are a better choice, as they are clipped to the shoreline, generalized, and therefore usually smaller in size without losing too much accuracy. To load cartographic boundaries, use the cb = TRUE argument. If you are doing detailed geospatial analysis and need the most detailed shapefiles, then you should use the detailed TIGER line boundary files and set cb = FALSE.

  • library(tigris) functions also allow you to load boundaries for previous years, though they will default to the most recent year.

Below is an example of how you would use library(tigris) to get a sf dataframe of all Census tracts in Maryland for 2015.

library(tigris)

# Only need to set once per script
options(tigris_class = "sf")

md_zctas <- tracts(state = "MD",
                  cb = TRUE,
                  year = 2015)

Unlike library(urbnmapr), different functions are used to get geographic data for different geographic levels. For instance, the blocks() function will load census block group data, and the tracts() function will load tract data. Other functions include block_groups(), zctas() , and core_based_statistical_areas(). For the full list of supported geographies and functions, see the package vignette.

For folks interested in pulling in Census demographic information along with Census geographies, we recommend checking out the sister package to library(tigris): library(tidycensus). That package allows you to download in Census variables and Census geographic data simultaneously.

For boundaries outside the US

We recommend using the library(rnaturalearth) package, which is similar to library(tigris) but allows you to download and use boundaries beyond the US. Instead of setting class to sf one time per session as we did with library(tigris), you must set the returnclass = "sf" argument each time you use a function from the package. Below is an example of downloading in an sf dataframe of all the countries in the world.

library(rnaturalearth)

world <- ne_countries(returnclass = "sf")

ggplot() +
  geom_sf(data = world, mapping = aes())

Your own files

Shapefiles/GeoJSONS

Shapefiles and GeoJSONs are 2 common spatial file formats you will found out in the wild. library(sf) has a function called st_read which allows you to easily read in these files as sf dataframes. The only required argument is dsn or data source name. This is the filepath of the .shp file or the .geojson file on your local computer. For geojsons, dsn can also be a URL.

Below is an example of reading in a shapefile of fire stations in DC which is stored in mapping/data/shapefiles/.

library(sf)

# Print out all files in the directory
list.files("mapping/data/shapefiles")
## [1] "Fire_Stations.cpg" "Fire_Stations.dbf" "Fire_Stations.prj"
## [4] "Fire_Stations.shp" "Fire_Stations.shx" "Fire_Stations.xml"

Note that shapefiles are actually stored as 6+ different files inside a folder. You need to provide the fileapth to the file ending in .shp.

dc_firestations <- st_read(dsn = "mapping/data/shapefiles/Fire_Stations.shp",
                           quiet = T)

And now dc_firestations is an sf dataframe you can use in all your mapping needs! st_read supports reading in a wide variety of other spatial file formats, including geodatabases, KML files, and over 200 others. For an incomplete list, please see the this sf vignette.

CSVs or dataframes with lat/lons

If you have a CSV with geographic information stored in columns, you will need to read in the CSV as a regular R dataframe and then convert to an sf dataframe. library(sf) contains the st_as_sf() function for converting regular R dataframes into an sf dataframe. The two arguments you must specify for this function are:

  • coords: A length 2 vector with the names of the columns corresponding to longitude and latitude (in that order!). For example, c("lon", "lat").
  • crs: The CRS (coordinate references system) for your longitude/latitude coordinates. Remember you need to specify both the
    authority and the SRID code, for exmaple (“EPSG:4326”). For more information on finding and setttnig CRS codes, please see this section. (TODO: INSERT CRS LINK)

Below is an example of reading in data from a CSV and converting it to an sf dataframe.

library(sf)

# Read in dataset of state capitals which is stored as a csv
state_capitals <- read_csv("mapping/data/state-capitals.csv")

state_capitals <- state_capitals %>% 
  # Specify names of the lon/lat columns in the CSV to use to make geometry col
  st_as_sf(coords = c("longitude", "latitude"),
           crs = 4326)

One common mistake is that before converting to an sf dataframe, you must drop any rows that have NA values for latitude or longitude. If your data contains NA values, then the st_as_sf() function will throw an error.

Appending spatial info to your data

Oftentimes, the data you are working with will just have state or county identifiers - like FIPS codes or state abbreviations - but will not contain any geographic information. In this case, you must do the extra work of downloading in the geographic data as an sf dataframe and then joining your non-spatial data to the spatial data. Generally this involves 3 steps:

  1. Reading in your own data as a data frame
  2. Reading in the geographic data as an sf dataframe
  3. Using left_join to merge the geographic data with your own non spatial data and create a new expanded sf dataframe

Let’s say we had a dataframe on CHIP enrollment by state with state abbreviations.

# read the state CHIP data
chip_by_state <- read_csv("mapping/data/chip-enrollment.csv") %>% 
  # clean column names so there are no random spaces/uppercase letters
  janitor::clean_names()

# print to the console
chip_by_state %>% head()
## # A tibble: 6 × 3
##   state      chip_enrollment state_abbreviation
##   <chr>                <dbl> <chr>             
## 1 Alabama             150040 AL                
## 2 Alaska               15662 AK                
## 3 Arizona              88224 AZ                
## 4 Arkansas            120863 AR                
## 5 California         2022213 CA                
## 6 Colorado            167227 CO

In order to convert this to an sf dataframe, we need to read in the spatial boundaries for each state and append the magical geometry column to turn it into an sf dataframe. Here is how we do that with get_urbn_map() and left_join() .

library(urbnmapr)

# read in state geographic data from urbnmapr
states <- get_urbn_map(map = "states", sf = TRUE)

# left join state geographies to chip data
chip_with_geographies = states %>% 
  left_join( 
    chip_by_state, 
    # Specify join column, which are slightly differently named in states and chip 
    # respectively
    by = c("state_abbv" = "state_abbreviation"))

chip_with_geographies %>% 
  select(state_fips, state_abbv, chip_enrollment)
## Simple feature collection with 51 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -2600000 ymin: -2363000 xmax: 2516374 ymax: 732352.2
## Projected CRS: US National Atlas Equal Area
## First 10 features:
##    state_fips state_abbv chip_enrollment                       geometry
## 1          01         AL          150040 MULTIPOLYGON (((1150023 -15...
## 2          04         AZ           88224 MULTIPOLYGON (((-1386136 -1...
## 3          08         CO          167227 MULTIPOLYGON (((-786661.9 -...
## 4          09         CT           25551 MULTIPOLYGON (((2156197 -83...
## 5          12         FL          374884 MULTIPOLYGON (((1953691 -20...
## 6          13         GA          232050 MULTIPOLYGON (((1308636 -10...
## 7          16         ID           35964 MULTIPOLYGON (((-1357097 78...
## 8          18         IN          114927 MULTIPOLYGON (((1042064 -71...
## 9          20         KS           79319 MULTIPOLYGON (((-174904.2 -...
## 10         22         LA          161565 MULTIPOLYGON (((1075669 -15...

A common pitfall when doing left_join()s is that some records may be missing from either the left hand side dataframe (in this case states) or the right hand side dataframe (in this case chip). This can happen if a few states are missing in your data, or if your data has misspelled states that weren’t matched.

We highly recommend using libraray(tidylog) and the tidylog::left_join() function - instead of the regular left_join function - to print additional information on the number of rows that are unmatched, or rows only in the left hand side/right hand side dataframes. This will help you identify errors and mismatches between datasets early on in your data ingestion process. For example say we also wanted to merge in a dataset of state population totals onto our chip_with_geographies dataframe. If we use tidylog::left_join()

# Read in data on state populations from 2010
state_pops <- 
  read_csv("https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv",
           # Set this to disable printing column info to console
           col_types = cols()) %>% 
  filter(ages == "total", year == "2010") %>% 
  select(state_abbv   = `state/region`, population)

chip_with_geographies <- chip_with_geographies  %>% 
  # Specify left_join from tidylog to print summary messages
  tidylog::left_join(state_pops, by = "state_abbv")
## left_join: added one column (population)
##            > rows only in x    0
##            > rows only in y  ( 2)
##            > matched rows     51
##            >                 ====
##            > rows total       51

After some digging, we see that the two unmatched rows in the state_pops dataframe correspond to PR and the USA total, which we choose to safely disregard.

Coordinate Reference Systems

The short version

Just watch this video:

The Urban Institute Style Guide requires the use of the Atlas Equal Earth Projection ("ESRI:102003") for US maps. For state and local maps, use this handy guide to find an appropriate State Plane projection.

The long version

Coordinate reference systems (CRS) specify the 3d shape of the earth and optionally how we project that 3d shape onto a 2d surface. They are an important part of working with spatial data as you need to ensure that all the data you are working with are in the same CRS in order for spatial operations and maps to be accurate.

CRS can be specified either by name (ie Maryland State Plane) or Spatial Reference System IDentifier (SRID). THe SRID is a numeric identifier that uniquely identifies a coordinate reference system. Generally when referring to an SRID, you need to refer to an authority (ie the data source) and a unique ID. An example is EPSG:26985 which refers to the Maryland State plane projection from the EPSG, or ESRI:102003 which refers to the Atlas Equal Area projection from ESRI. Most CRS codes will be from the EPSG, and some from ESRI and others. A good resource for finding/validating CRS codes is epsg.io.

Sidenote - EPSG stands for the now defunct European Petroleum Survey Group. And while oil companies have generally been terrible for the earth, the one nice thing they did for the earth was to set up common standards for coordinate reference systems.

You might be thinking well isn’t the earth just a sphere? Why do we need all this complicated stuff? And the answer is well the earth is kind of a sphere, but it’s really more of a misshapen ellipsoid which is pudgier at the equator than at the poles. To visualize how coordinate reference systems work, imagine that the earth is a (lumpy) orange. Now peel the skin off an orange and try to flatten it. There are many ways to do it, but all will create distortions of some kind. The CRS will give us the formula we’ve used to specify the shape of the orange (usually a sphere or ellipsoid of some kind) and optionally, specify how we flattened the orange into 2d.

Broadly, there are two kinds of Coordinate Reference Systems:

  1. Geographic coordinate systems

    • (sometimes called unprojected coordinate systems)
    • Specifies a 3d shape for the earth
    • Uses a spheroid/ellipsoid to approximate shape of the earth
    • Usually use decimal degree units (ie latitude/longitude) to identify locations on earth
  2. Projected coordinate systems

    • Specifies a 3d shape for the earth + a 2d mapping

      • Is a geographic coordinate system + a projection

      • projection: mathematical formula used to convert a 3d coordinate system to a 2d flat coordinate system

      • Many different kinds of projections, including Equal Area, Equidistant, Conformal, etc

      • All projections distort the true shape of the earth in some way, either in terms of shape, area, or angle. Required xkcd comic

      • Usually use linear units (ie feet, meters) and therefore useful for distance based spatial operations (ie creating buffers)

Finding the CRS

If you are lucky, your data will have embedded CRS data that will be automatically detected when the file is read in. This is usually the case for GeoJSONS (.geojson) and shapefiles (.shp). When you use st_read() on these files, you should see the CRS displayed in the metadata:

You can also the st_crs() function to find the CRS associated with your sf dataframe. Note that the information returned is very long, but the key information is located at the end in ID[authority, SRID]. For example:

st_crs(dc_firestations)
## Coordinate Reference System:
##   User input: WGS 84 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4326]]

Sometimes however, the CRS will be blank or NA as the dataset did not specify the CRS. In that case you MUST find and set the CRS for your data before proceeding with analysis. Below are some good rules of thumb for finding out what the CRS for your data is:

  • For geojsons, the CRS should always be EPSG:4326 (or WGS 84). The official geojson specification states that this is the only valid crs for geojsons, but in the wild, this may not be true 100% of the time.
  • For shapefiles, there should be a file that ends in .proj in the same directory as the .shp file. This file contains the projection information for that file.
  • For CSV’s with latitude/longitude columns, the CRS is usually EPSG:4326 (or WGS 84).
  • Look at the metadata and any accompanying documentation to see if the coordinate reference system for the data is specified

If none of the above rules of thumb apply to you, check out Whattheproj, or the crsuggest R package for help identifying the CRS in your data. Once you’ve identified the appropriate CRS, you can set the CRS for your data with st_crs():

# If you are certain that your data contains coordinates in the ESRI Atlas Equal Earth projections
some_sf_dataframe = st_crs("ESRI:102003")

Transforming the CRS

Often you will need to change the CRS for your sf dataframe. You can do this with st_transform:

# Tranforming CRS from WGS 84 to Urban required Equal Earth Projection
dc_firestations = dc_firestations %>% st_transform("ESRI:102003")

if you are working with multiple sf dataframes which have different CRS, you will need to transform the CRS to all be the same. Luckily, st_transform allows you to just use the CRS of another sf dataframe when transforming, so an easy way to do this would be

# transform CRS of df to be the same as CRS of dc_firestations
chip_with_geographies = chip_with_geographies %>% 
  st_transform(crs = st_crs(dc_firestations))

Mapping

In order to start mapping, you need an sf dataframe. If you don’t have one, see the Basic Concepts section above.

The basics

library(ggplot2)

Most mapping in R fits the same theoretical framework as plotting in R using library(ggplot2). To learn more about ggplot2, visit the Data Viz page or read the official ggplot book

The key function for mapping is the special geom_sf() geom which works with sf dataframes. This function magically detects whether you have point or polygon spatial data and displays the results on a map.

A simple map

To make a simple map, all you need to do is add a geom_sf() to a ggplot and set the data argument equal to an sf dataframe. Say for example you wanted to make a map of all 50 states using data from library(urbnmapr):

library(urbnmapr)

states <- get_urbn_map("states", sf = TRUE)

ggplot() +
  geom_sf(data = states, mapping = aes())

Note that we set no aes mappings as we currently do not want to apply any mappings (like coloring in the states by a variable or allowing the transparency to vary by a variable). We will cover how to set such aesthetic mappings later.

Styling

library(urbnthemes)

library(urbnthemes) automatically styles maps in accordance with the Urban Institute Data Visualization Style Guide. By using library(urbnthemes), you can create publication ready maps you can immediately drop in to Urban research briefs or blog posts.

To install urbnthemes, visit the package’s GitHub repository and follow the instructions.

You can use the automated theming functions in two ways: set the map defaults once per script, or add theme_urbn_map() to the end of each of your created maps. These functions will get rid of unnecessary axes, labels, and gridlines that are useful for charts, but not needed for maps.

library(urbnthemes)

# You can either run this once per script to automatically style all maps with
# the Urban theme
set_urbn_defaults(style = "map")

# Or you can add `+ theme_urbn_map()` to the end of every plot you make
ggplot() +
  geom_sf(states, mapping = aes()) +
  theme_urbn_map()

Layering

Just like in other GIS software, you can layer multiple geometries on top of each other using the + operator from library(ggplot2). The shapes will appear from bottom to top (ie the last mapped object will show up on top). It is important that all layers are in the same CRS (coordinate reference system).

state_capitals = state_capitals %>% 
  # This will change CRS to ESRI:102003 and shift the AK and HI state capitals
  # point locations to the appropriate locations on the inset maps.
  tigris::shift_geometry()

ggplot() +
  geom_sf(data = states,
          mapping = aes()) +
  geom_sf(data = state_capitals, 
          mapping = aes(),
          # urbnthemes library has urbn color palettes built in.
          color = palette_urbn_main['cyan'], 
          size = 2.0) +
  theme_urbn_map()

Fill and Outline Colors

The same commands used to change colors, opacity, lines, size, etc. in charts can be used for maps too. For instance, to change the colors of the map above, just use the fill = and color = parameters in geom_sf(). fill will change the fill color of polygons; color will change the color of polygon outlines, lines, and points.

library(urbnthemes) contains inbuilt. helper variables (like palette_urbn_main) for accessing color palettes from the Urban Data Viz Style guide. If for example you want states to be Urban’s magenta color:

ggplot() +
  geom_sf(states, mapping = aes(),
          # Adjust polygon fill color
          fill = palette_urbn_main['magenta'],
          # Adjust polygon outline color
          color = "white") +
  theme_urbn_map()

Adding text

You can also add text, like state abbreviations, directly to your map using geom_sf_text and the helper function get_urbn_labels().

ggplot() +
  geom_sf(states, mapping = aes(),
          color = "white") +
  theme_urbn_map() +
  # Generates dataframe of state abbv and appropriate location to plot them
  geom_sf_text(data = get_urbn_labels(map = "states", 
                                      sf = TRUE), 
                aes(label = state_abbv), 
            size = 3)

There’s also geom_sf_label() if you want labels with a border.

ggplot() +
  geom_sf(states, mapping = aes(),
          color = "white") +
  theme_urbn_map() +
  geom_sf_label(data = get_urbn_labels(map = "states", 
                                      sf = TRUE), 
                aes(label = state_abbv), 
            size = 3)

Urban Institute style

  • U.S. dot maps and U.S. choropleths use the Albers Equal-Area Conic Projection. Tile-based maps like tile grid maps and geofaceting use the Mercator projection.
  • Maps that show the magnitude of a variable use the blue sequential ramp and maps that display positives and negatives use the diverging color ramp.
  • In general, the borders between geographies are white.

Choropleth Maps

Choropleth maps display geographic areas with shades, colors, or patterns in proportion to a variable or variables. Choropleth maps can represent massive geographies like the entire world and small geographies like Census Tracts. To make a choropleth map, you need to set the fill argument equal to a variable in your data inside the aes() argument of geom_sf. As an example:

chip_with_geographies %>% 
  ggplot() +
  geom_sf(aes(fill = chip_enrollment), 
          # Set state outline color to white
          color = "white")

When we set an aesthetic mapping inside aes(), we are telling R that we want some aspect of the plot to change in accordance with a variable in our dataset. And because we want the fill color to vary, we use the fill aesthetic.

Continuous color scale

Let’s say we wanted to make a choropleth map of states colored in by CHIP enrollment as a percentage of total population (ie the chip_enrollment variable divided by the population variable in the chip_with_geographies dataframe). Below is the code we would use

# Calculate the chip enrollment percentage and append as a column
chip_with_geographies <- chip_with_geographies %>% 
  mutate(chip_pct = chip_enrollment/population)

# make chloropleth map
chip_with_geographies_map = ggplot() + 
  geom_sf(data = chip_with_geographies, 
          # we set the fill aesthetic to be a column in the data
          aes(fill = chip_pct),
          # set white boundaries
          color = "white") + 
  theme_urbn_map()

chip_with_geographies_map

And if we wanted to turn the text in the legend to be percentages, we can use the labels argument of the scale_fill_continuous function.

chip_with_geographies_map +
  scale_fill_continuous(labels = scales::percent_format())

Discrete color scale

You can also have discrete color bins (ie 0-1%, 1-2%, etc) for the legend using the scale_fill_steps() function.

chip_with_geographies_map +
  # Make the text discrete and format as percentages
  scale_fill_steps(labels = scales::percent_format())

Be careful when doing this as changing the number or size of bins can drastically change how the map looks. It might also be helpful to manually create your bins yourself by converting your continuous variables into factors. This will give you more fine grained control over exactly what the text looks like in the legend. Below is an example of discretizing the continuous variable youself using cut_interval().

format_interval = function(interval_text){
  
  # Remove open and close brackets which is math notation
  text = interval_text %>% 
    str_remove_all("\\(") %>% 
    str_remove_all("\\)") %>% 
    str_remove_all("\\[") %>% 
    str_remove_all("\\]") %>% 
    str_replace_all(",", " - ")
  
  # Convert decimal ranges to percent ranges
  text = text %>% 
    str_split(" - ") %>% 
    map(~as.numeric(.x) %>% scales::percent() %>% paste0(collapse = " - "))
  
  return(text)
}

chip_with_geographies %>% 
  add_row(chip_pct = 0) %>% 
  # cut_interval makes n groups with equal range
  mutate(chip_pct_interval = cut_interval(chip_pct, n = 5)) %>% 
  mutate(chip_pct_interval = format_interval(chip_pct_interval))
## Simple feature collection with 52 features and 8 fields (with 1 geometry empty)
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -3015029 ymin: -1509338 xmax: 2258200 ymax: 1565782
## Projected CRS: USA_Contiguous_Albers_Equal_Area_Conic
## First 10 features:
##    state_fips state_abbv  state_name       state chip_enrollment population
## 1          01         AL     Alabama     Alabama          150040    4785570
## 2          04         AZ     Arizona     Arizona           88224    6408790
## 3          08         CO    Colorado    Colorado          167227    5048196
## 4          09         CT Connecticut Connecticut           25551    3579210
## 5          12         FL     Florida     Florida          374884   18846054
## 6          13         GA     Georgia     Georgia          232050    9713248
## 7          16         ID       Idaho       Idaho           35964    1570718
## 8          18         IN     Indiana     Indiana          114927    6489965
## 9          20         KS      Kansas      Kansas           79319    2858910
## 10         22         LA   Louisiana   Louisiana          161565    4545392
##                          geometry    chip_pct chip_pct_interval
## 1  MULTIPOLYGON (((761146.2 -7... 0.031352587       2.2% - 3.2%
## 2  MULTIPOLYGON (((-1743454 -3... 0.013766093       1.1% - 2.2%
## 3  MULTIPOLYGON (((-1123223 20... 0.033126091       3.2% - 4.3%
## 4  MULTIPOLYGON (((1838898 619... 0.007138726       0.0% - 1.1%
## 5  MULTIPOLYGON (((1553947 -12... 0.019891909       1.1% - 2.2%
## 6  MULTIPOLYGON (((939223.1 -2... 0.023890052       2.2% - 3.2%
## 7  MULTIPOLYGON (((-1673717 95... 0.022896535       2.2% - 3.2%
## 8  MULTIPOLYGON (((687614.3 74... 0.017708416       1.1% - 2.2%
## 9  MULTIPOLYGON (((-511746.5 2... 0.027744490       2.2% - 3.2%
## 10 MULTIPOLYGON (((685021.8 -8... 0.035544789       3.2% - 4.3%

When mapping, you need to be very careful when setting color breaks for legends as slightly different breaks can generate very different looking maps

Interactive Maps

Interactive maps can be a great tool to explore and understand your data. And luckily there are a lot of new R packages that make it really easy to create them.

library(mapview)

library(mapview) is probably the most user friendly of the interactive mapping R libraries. All you have to do to create an interactive map is:

library(mapview)

mapview(chip_with_geographies)

A couple of things to note about this interactive map:

  - When you click on an object, you get a popup table of all it's attributes
  - When you hover over an object, you get a popup with an object id
  - The default basemap used is from Carto 

Each of the above behaviors can be changed if desired. As you’ll see in the below section, the syntax for library(mapview) is significantly different from libraray(ggplot2) so be careful!

Coloring in points/polygons

In order to create a chloropleth map where we color in the points/polygons by a variable, we need to feed in a column name in quotes to thezcol argument inside the mapview() function:

# Create interactive state map colored in by chip enrollment,
mapview(chip_with_geographies, zcol = "chip_enrollment")

If you want more granular control over the color palette for the legend can also feed in a vector of color hex codes to col.regions along with a column name to zcol. This will create a continuous color range along the provided colors. Be careful though as the color interpolation is not perfect.

library(RColorBrewer)
mapview(chip_with_geographies, 
        col.regions = c(palette_urbn_green[6], 
                        "white", 
                        palette_urbn_cyan[6]),
        zcol = "chip_enrollment") 

If you want to color in all points/polygons as the same color, just feed in a single color hex code to the col.regions argument:

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) 

Adding layers

You can add multiple sf objects on the same map by using the + operator. This is very useful when comparing 2 or more spatial datasets.

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) +
  mapview(state_capitals, col.regions = palette_urbn_cyan[5])

You can even create slider maps by using the | operator!

mapview(chip_with_geographies, col.regions = palette_urbn_green[5]) |
  mapview(state_capitals, col.regions = palette_urbn_cyan[5])

More details

To learn more about more advanced options with mapview maps, check out their documentation page and the reference manual.

There are also other interactive map making packages in R like leaflet (which mapview is a more user friendly wrapper of), tmap, and mapdeck. To learn about these other packages, this book chapter is a good starting point.

Dot Maps

Tile Grid Maps

Geofacetting

Cartograms

Spatial Operations

Calculating Distance

Spatial Joins

Cropping

Aggregating

Drive/Transit times

Geocoding

Geocoding is the process of turning text (usually addresses) into geographic coordinates (usually latitudes/longitudes) for use in mapping. For Urban researchers, we highly recommend using the Urban geocoder as it is fast, accurate, designed to work with sensitive/confidential data and most importantly free to use for Urban researchers! To learn about how we set up and chose the geocoder for the Urban Institute, you can read our Data@Urban blog.

Cleaning Addresses

The single most important factor in getting accurate geocoded data is having clean, well structured address data. This can prove difficult as address data out in the wild is often messy and unstandardized. While the rules for cleaning addresses are very data specific, below are some examples of clean addresses you should aim for in your data cleaning process:

library(gt)
cleaned_address_table  = tribble(
  ~"f_address", ~"Type of address",
  "123 Troy Drive, Pillowtown, CO, 92432", "residnetial address",
  "789 Abed Avenue, Apt 666, Blankesburg, CO, 92489", "residential apartment address",
  "Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489", "street intersection",
  "Pillowtown, CO", "city",
  "92489, CO", "Zip Code",
  
  )

gt(cleaned_address_table) %>% 
    # tab_header(title = md("Clean Address Examples")) %>% 
    opt_row_striping(row_striping = TRUE) %>% 
    tab_style(
    style = list(
        cell_text(weight = "bold")
        ),
      locations = cells_column_labels(
        columns = vars(f_address, `Type of address`)
      )) %>% 
    opt_align_table_header(align = c("left")) %>% 
    tab_options(
      container.width = "100%",
      container.height = "400px",
      # column_labels.background.color  = palette_urbn_cyan[1],
      table.border.top.width = 0,
      table.border.bottom.width = 0,
      column_labels.border.bottom.width= 0,
      

    )
f_address Type of address
123 Troy Drive, Pillowtown, CO, 92432 residnetial address
789 Abed Avenue, Apt 666, Blankesburg, CO, 92489 residential apartment address
Shirley Boulevard and Britta Drive, Blanketsburg, CO, 92489 street intersection
Pillowtown, CO city
92489, CO Zip Code

All that being said, our geocoder is pretty tolerant of different address formats, typos/spelling errors and missing state, zip codes, etc. So don’t spend too much time cleaning every address in the data. Also note that while our geocoder is able to geocode cities and zip codes, it will return the lat/lon of the center of the city/zip code, which may not be what you want.

Instructions

To use the Urban geocoder, you will need to:

  1. Generate a CSV with a column named f_address which contains the addresses in single line format (ie 123 Abed Avenue, Blanketsburg, CO, 94328). This means that if you have the addresses split across multiple columns (ie Address, City, State, Zip columns), you will need to concatenate them into one column. Also see our Address cleaning section above.

  2. Go to the Urban geocoder and answer the initial questions. This will tell you whether your data is non-confidential or confidential data, and allow you to upload your CSV for geocoding.

  3. Wait for an email telling you your results are ready. If your data is non-confidential, this email will contain a link to your geocoded results. This link expires in 24 hours, so make sure to download your data before then. If you data is confidential, the email will contain a link to the location on the Y Drive where your confidential geocoded data is stored. You can specify this output folder when submitting the CSV in step 1.

Geocoder outputs

The geocoded file will be your original data, plus a few more columns (including latitude and longitude). each of the new columns that have been appended to your original data. It’s very important that you take a look at the Addr_type column in the CSV before doing further analysis to check the accuracy of the geocoding process.

Column Description
Ma tch_addr The actual address that the inputted address was matched to. This is the address that the geocoder used to get Latitudes / Longitudes. If there are potentially many typos or non standard address formats in your data file, you will want to take a close look at this column to confirm that the matched address correctly handled typos and badly formatted addresses.
Longitude | The WGS 84 datum Longitude (EPSG code 4326)
Latitude The WGS 84 datum Latitude (EPSG code 4326)
A ddr_type The match level for a geocode request. This should be used as an indicator of the precision of geocode results. Generally, Subaddress, PointAddress, StreetAddress, and StreetInt represent accurate matches. The list below contains all possible values for this field. Green values represent High accuracy matches, yellow represents Medium accuracy matches and red represents Low accuracy/inaccurate matches. If you have many yellow and red values in your data, you should manually check the results before proceeding with analysis. All possible values:

Subaddress: A subset of a PointAddress that represents a house or building subaddress location, such as an apartment unit, floor, or individual building within a complex. The UnitName, UnitType, LevelName, LevelType, BldgName, and BldgType field values help to distinguish subaddresses which may be associated with the same PointAddress. Reference data consists of point features with associated house number, street name, and subaddress elements, along with administrative divisions and optional postal code; for example, 3836 Emerald Ave, Suite C, La Verne, CA, 91750.

PointAddress: A street address based on points that represent house and building locations. Typically, this is the most spatially accurate match level. Reference data contains address points with associated house numbers and street names, along with administrative divisions and optional postal code. The X / Y (Longitude/Latitude) and geometry output values for a PointAddress match represent the street entry location for the address; this is the location used for routing operations. The DisplayX and DisplayY values represent the rooftop, or actual, location of the address. Example: 380 New York St, Redlands, CA, 92373.

StreetAddress — A street address that differs from PointAddress because the house number is interpolated from a range of numbers. Reference data contains street center lines with house number ranges, along with administrative divisions and optional postal code information, for example, 647 Haight St, San Francisco, CA, 94117.

StreetInt: A street address consisting of a street intersection along with city and optional state and postal code information. This is derived from StreetAddress reference data, for example, Redlands Blvd & New York St, Redlands, CA, 92373.

StreetName: Similar to a street address but without the house number. Reference data contains street centerlines with associated street names (no numbered address ranges), along with administrative divisions and optional postal code, for example, W Olive Ave, Redlands, CA, 92373.

StreetAddressExt: An interpolated street address match that is returned when parameter matchOutOfRange=true and the input house number exceeds the house number range for the matched street segment.

DistanceMarker: A street address that represents the linear distance along a street, typically in kilometers or miles, from a designated origin location. Example: Carr 682 KM 4, Barceloneta, 00617.

PostalExt: A postal code with an additional extension, such as the United States Postal Service ZIP+4. Reference data is postal code points with extensions, for example, 90210-3841.

POI: —Points of interest. Reference data consists of administrative division place-names, businesses, landmarks, and geographic features, for example, Golden Gate Bridge.

Locality: A place-name representing a populated place. The Type output field provides more detailed information about the type of populated place. Possible Type values for Locality matches include Block, Sector, Neighborhood, District, City, MetroArea, County, State or Province, Territory, Country, and Zone. Example: Bogotá, COL,

PostalLoc: A combination of postal code and city name. Reference data is typically a union of postal boundaries and administrative (locality) boundaries, for example, 7132 Frauenkirchen.

Postal: Postal code. Reference data is postal code points, for example, 90210 USA.
Score A number from 1–100 indicating the degree to which the input tokens in a geocoding request match the address components in a candidate record. A score of 100 represents a perfect match, while lower scores represent decreasing match accuracy.
Status Indicates whether a batch geocode request results in a match, tie, or unmatched. Possible values include

M - Match. The returned address matches the input address and is the highest scoring candidate.

T - Tied. The returned address matches the input address but has the same score as one or more additional candidates.

U - Unmatched. No addresses match the inputtd address.
geometry The WKT (Well-known text) representation of the latitudes and longitudes. This column may be useful if you’re reading the CSV into R, Python, or ArcGIS
Region The state that Match_addr is located in
Re gionAbbr Abbreviated State Name. For example, CA for California
S ubregion The county that the input address is located in
M etroArea The name of the Metropolitan area that Match_addr is located in. This field may be blank if the input address is not located within a metro area.
City The city that Match_addr is located in
Nbrhd The Neighborhood that Match_addr is located in. Note these are ESRI defined neighborhoods which may or may not align with other sources neighborhood definitions

While interactive maps are a great tool to explore and understand your data, we do not recommend using them in Urban research briefs or blog posts. They are unfortunately pretty difficult to style in compliance with the Data Viz Style Guide and there are potential legal concerns with some of the basemaps not allowing free use for research purposes.

There are also legal concerns with using some of the common basemaps used in these interactive maps. Many are not open source and their terms of use prohbit them from being used for research purposes.

There are many packages in R that allow you to quickly create interactive maps. The two mai

11 –>